We’ve completed an initial analysis of all of the prediction files that were submitted in Round 1. Specifically, we performed a bootstrap analysis of all of the prediction files to assess the stability of each submission with respect random subsets of the gold standard values. This analysis permitted the calculation of Bayes factors for all of these participants. We consider a submission with Bayes factor of 3 or less relative to the top-performing submission as an equally robust prediction as the top-performing submission.
Each prediction file was bootstrapped using a paired random sampling with replacement approach. Each prediction file was randomly sampled 10000 times, and each sampled set was scored to generate a distribution of bootstrapped scores per prediction file. For each bootstrapped set, we calculated Bayes factors relative to the top performer.
We plotted the boxplots summarizing the bootstrapped values for each metric using the leaderboard value superimposed on the distribution of the bootstrapped prediction. Bars are ranked highest to lowest performer (based on single leaderboard value for each metric). Orange points are the actual leaderboard value. We’ve colored each boxplot by Bayes factor, so any green colored boxes have a Bayes factor <3 and are treated as equally-performing predictions. Light blue is 3-5 (borderline equivalent), dark blue 5-20, and black >20.
Each header below identifies the metric for that plot. The metrics are presented in the same order as the leaderboard and do not indicate preference for one metric over another.